Mining, Classification, Prediction

Rebecca Gu rjg2836

Introduction

This dataset is the State-Level Education and Voter Turnout in 2016. It has a total of 14 variables: 8 numeric variables, two binary, two categorical variables, a state identifier variable, and the year variable which is the same for all observations. It contains the year of election (year), the state abbreviation variables (state), the state’s Census region (region), the state’s Census division (division), voter turnout for the highest office as percent of voting-eligible population (turnoutho), the percentage of the state that completed high school (pershed), the percentage of the state that completed college (percoled), an estimate of the state’s GDP per capita (gdppercap), if it is a swing state (ss), if Trump won the state (trumpw), the share of the vote Trump received (trumpshare), the state-level unemployment rate entering Nov. 2016 (sunempr), the state-level unemployment rate (12-month difference) entering Nov. 2016 (sunempr12md), and an estimate of the state’s GDP (gdp).

I found the data from stevedata, and it is a data frame with 51 observations of the 13 variables (excluding year). For the region variables, there are 12 states in North Central, 9 states in the Northeast, 17 states in the South, and 13 states in the West. For the divisions, 5 are in East North Central, 4 are in East South Central, 3 in Middle Atlantic, 8 in Mountain, 6 in New England, 5 in Pacific, 9 in South Atlantic, 7 in West North Central, and 4 in West South Central. For the binary vairbale trumpw, 30 states are 1, 21 states are 0. For the swing state variable, 39 states are not, and 12 states are.

library(tidyverse)
library(stevedata)
election_turnout
## # A tibble: 51 x 14
##     year state region division turnoutho perhsed percoled gdppercap    ss trumpw
##    <dbl> <chr> <chr>  <chr>        <dbl>   <dbl>    <dbl>     <int> <int>  <int>
##  1  2016 Alab… South  East So…      59      84.3     23.5     42663     0      1
##  2  2016 Alas… West   Pacific       61.3    92.1     28       81801     0      1
##  3  2016 Ariz… West   Mountain      55      86       27.5     43269     0      1
##  4  2016 Arka… South  West So…      52.8    84.8     21.1     41129     0      1
##  5  2016 Cali… West   Pacific       56.7    81.8     31.4     61924     0      0
##  6  2016 Colo… West   Mountain      70.1    90.7     38.1     58009     1      0
##  7  2016 Conn… North… New Eng…      65.2    89.9     37.6     72331     0      0
##  8  2016 Dela… South  South A…      64.4    88.4     30       69930     0      0
##  9  2016 Dist… South  South A…      60.9    89.3     54.6    181185     0      0
## 10  2016 Flor… South  South A…      64.6    86.9     27.3     42595     1      1
## # … with 41 more rows, and 4 more variables: trumpshare <dbl>, sunempr <dbl>,
## #   sunempr12md <dbl>, gdp <dbl>
election_turnout %>% group_by(division) %>% count
## # A tibble: 9 x 2
## # Groups:   division [9]
##   division               n
##   <chr>              <int>
## 1 East North Central     5
## 2 East South Central     4
## 3 Middle Atlantic        3
## 4 Mountain               8
## 5 New England            6
## 6 Pacific                5
## 7 South Atlantic         9
## 8 West North Central     7
## 9 West South Central     4
election_turnout %>% group_by(region) %>% count
## # A tibble: 4 x 2
## # Groups:   region [4]
##   region            n
##   <chr>         <int>
## 1 North Central    12
## 2 Northeast         9
## 3 South            17
## 4 West             13
election_turnout %>% group_by(trumpw) %>% count
## # A tibble: 2 x 2
## # Groups:   trumpw [2]
##   trumpw     n
##    <int> <int>
## 1      0    21
## 2      1    30
election_turnout %>% group_by(ss) %>% count
## # A tibble: 2 x 2
## # Groups:   ss [2]
##      ss     n
##   <int> <int>
## 1     0    39
## 2     1    12

Cluster Analysis

library(cluster)
pam_dat <- election_turnout %>% dplyr::select(turnoutho, trumpshare, 
    percoled) %>% scale
sil_width <- vector()  #empty vector to hold mean sil width
for (i in 2:10) {
    pam_fit <- pam(pam_dat, k = i)
    sil_width[i] <- pam_fit$silinfo$avg.width
}
ggplot() + geom_line(aes(x = 1:10, y = sil_width)) + scale_x_continuous(name = "k", 
    breaks = 1:10)

# k=2 has the highest sil width
pam1 <- pam_dat %>% pam(k = 2)

library(plotly)
final <- election_turnout %>% select(-ss, -trumpw, -year, -state, 
    -division) %>% mutate(cluster = as.factor(pam1$clustering))
final %>% plot_ly(x = ~sunempr, y = ~gdppercap, z = ~percoled, 
    color = ~cluster, type = "scatter3d", mode = "markers", symbol = ~region, 
    symbols = c("circle", "x", "o"))
library(GGally)
ggpairs(final, columns = 2:9, aes(color = cluster))

pam1$silinfo$avg.width
## [1] 0.3775957
plot(pam1, which = 2)

election_turnout %>% slice(pam1$id.med)
## # A tibble: 2 x 14
##    year state region division turnoutho perhsed percoled gdppercap    ss trumpw
##   <dbl> <chr> <chr>  <chr>        <dbl>   <dbl>    <dbl>     <int> <int>  <int>
## 1  2016 Sout… South  South A…      56.7    85.6     25.8     40212     0      1
## 2  2016 Wash… West   Pacific       64.8    90.4     32.9     62213     0      0
## # … with 4 more variables: trumpshare <dbl>, sunempr <dbl>, sunempr12md <dbl>,
## #   gdp <dbl>

I first chose the three variables turnoutho,trumpshare, and percoled and then conducted a PAM analysis of k=2 because it had the highest average silhouette width of .377. The two clusters were similar in both state level unemployment rate measurements, but differed the most in percentage of the state that completed high school (pershed) and voter turnout for the highest office as percent of voting-eligible population (turnoutoh). These two clusters’ mediods are South Carolina and Washington, and the most prominent distinction is if Trump won the state or not in the 2016 election. The cluster with South Carolina had lower percentage of completed high school and completed college (pershed/percoled), lower GDP per cap (gdppercap), lower turnout rates (turnoutoh), and higher share of the vote Trump received (trumpshare). The interpretation of the average silhouette width is that the structure is weak and could be artificial.

Dimensionality Reduction with PCA

pca_dat <- election_turnout %>% dplyr::select(-ss, -trumpw, -year, 
    -state, -division, -region, -sunempr12md) %>% scale
pcac <- princomp(pca_dat, cor = T)
summary(pcac, loading = T)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4    Comp.5
## Standard deviation     1.6741666 1.4003838 0.9412238 0.8398521 0.5872072
## Proportion of Variance 0.4004048 0.2801535 0.1265575 0.1007645 0.0492589
## Cumulative Proportion  0.4004048 0.6805584 0.8071158 0.9078803 0.9571392
##                            Comp.6     Comp.7
## Standard deviation     0.46411810 0.29089462
## Proportion of Variance 0.03077223 0.01208853
## Cumulative Proportion  0.98791147 1.00000000
## 
## Loadings:
##            Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## turnoutho   0.348  0.251  0.334  0.769         0.283  0.174
## perhsed     0.359  0.501               -0.272 -0.733       
## percoled    0.565 -0.132        -0.116         0.216 -0.773
## gdppercap   0.438 -0.270 -0.460        -0.548  0.248  0.399
## trumpshare -0.462  0.321         0.177 -0.640  0.270 -0.402
## sunempr    -0.161 -0.500 -0.405  0.600        -0.386 -0.217
## gdp               -0.492  0.707        -0.453 -0.228
eigval <- pcac$sdev^2  #square to convert SDs to eigenvalues
varprop = round(eigval/sum(eigval), 2)  #proportion of var explained by each PC

ggplot() + geom_bar(aes(y = varprop, x = 1:7), stat = "identity") + 
    xlab("") + geom_path(aes(y = varprop, x = 1:7)) + geom_text(aes(x = 1:7, 
    y = varprop, label = round(varprop, 2)), vjust = 1, col = "white", 
    size = 5) + scale_y_continuous(breaks = seq(0, 0.6, 0.2), 
    labels = scales::percent) + scale_x_continuous(breaks = 1:10)

library(factoextra)
fviz_pca_biplot(pcac)

To score high on PC1 means that the state has a low trumpshare, and a low state state-level unemployment rate entering Nov. 2016 (sunempr). The state has high voter turnout, percent of state that completed both high school and college, and also has a high GDP per capita. If the state has a low PC1 score, it is the opposite. This PC focuses on high educational attainments vs trumpshare. For PC2, a high score means a high percentage of high school completion, a high share of Trump supporters, and a large voter turnout. The high PC2 scores will have minimal unemployment rates, GDP related variables, and percent of college graduates. This PC seems to focus on unemployment rates vs percent high school completion. If the state has a low PC2 score, it is the opposite. Finally, for PC3, there are only four variables. To score high on PC3 means having a high state GDP and voter turnout, but low or minimal GDP per capita and unemployment rates. Scoring low is the opposite- high GDP per capita and high unemployment rates, but low GDP and voter turnout. This PC focuses on GDP vs GDP per capita. With the 3 principal components, 80.71% of the total variance in the dataset is explained.

Linear Classifier

numerics <- election_turnout %>% dplyr::select(-ss, -year, -state, 
    -division, -region, )
logistic_fit <- glm(trumpw == "True" ~ ., data = numerics, family = "binomial")
prob_reg <- predict(logistic_fit, type = "response")
class_diag(prob_reg, truth = election_turnout$trumpw, positive = 1)
##            acc sens spec ppv  f1  ba    auc
## Metrics 0.4118    0    1 NaN NaN 0.5 0.4794
table(truth = factor(election_turnout$trumpw == 1, levels = c("TRUE", 
    "FALSE")), prediction = factor(prob_reg > 0.5, levels = c("TRUE", 
    "FALSE"))) %>% addmargins
##        prediction
## truth   TRUE FALSE Sum
##   TRUE     0    30  30
##   FALSE    0    21  21
##   Sum      0    51  51
set.seed(10)
k = 10

data <- sample_frac(numerics)  #randomly order rows
folds <- rep(1:k, length.out = nrow(data))  #create folds

diags <- NULL

i = 1
for (i in 1:k) {
    # create training and test sets
    train <- data[folds != i, ]
    test <- data[folds == i, ]
    truth <- test$trumpw
    
    # train model
    fit <- glm(trumpw == "True" ~ ., data = train, family = "binomial")
    
    # test model
    probs <- predict(fit, newdata = test, type = "response")
    
    # get performance metrics for each fold
    diags <- rbind(diags, class_diag(probs, truth, positive = 1))
}
# average performance metrics across all folds
summarize_all(diags, mean)
##    acc sens spec ppv  f1  ba     auc
## 1 0.41    0    1 NaN NaN 0.5 0.54722

The logistic regression model has a low scoring AUC of .479, which is worse than a 50-50 chance of guessing. The accuracy is also very low. The confusion matrix shows that the model predicts false for all states. However, k-fold CV predicts more effectively than the logistic regression. After setting the seed so that the output is constant, the AUC is .547. I do not see the typical signs of overfitting, in fact, it is just the opposite since the CV tests perform better.

Non-Parametric Classifier

library(caret)
knn_fit <- knn3(factor(trumpw == 1, levels = c("TRUE", "FALSE")) ~ 
    ., data = numerics)
prob_knn <- predict(knn_fit, numerics)
class_diag(prob_knn[, 1], election_turnout$trumpw, positive = 1)
##            acc sens   spec    ppv     f1     ba    auc
## Metrics 0.6667  0.8 0.4762 0.6857 0.7385 0.6381 0.7278
table(truth = factor(election_turnout$trumpw == 1, levels = c("TRUE", 
    "FALSE")), prediction = factor(prob_knn[, 1] > 0.5, levels = c("TRUE", 
    "FALSE"))) %>% addmargins
##        prediction
## truth   TRUE FALSE Sum
##   TRUE    24     6  30
##   FALSE   11    10  21
##   Sum     35    16  51
set.seed(10)
k = 10  #choose number of folds
data <- sample_frac(numerics)  #randomly order rows
folds <- rep(1:k, length.out = nrow(data))  #create folds
diags <- NULL
for (i in 1:k) {
    ## Create training and test sets
    train <- data[folds != i, ]
    test <- data[folds == i, ]
    truth <- test$trumpw  ## Truth labels for fold i
    ## Train model on training set (all but fold i)
    fit <- knn3(trumpw ~ ., data = train)
    ## Test model on test set (fold i)
    probs <- predict(fit, newdata = test)[, 2]
    ## Get diagnostics for fold i
    diags <- rbind(diags, class_diag(probs, truth, positive = 1))
}
summarize_all(diags, mean)
##       acc    sens spec     ppv      f1      ba     auc
## 1 0.47333 0.68334 0.25 0.57334 0.57888 0.46666 0.46526

The kNN model does much better than logistic regression, but still gets a ‘C’ score. After CV though, the AUC drops to .46 and makes it a “bad” model. I see signs of overfitting because of the drastic change in AUC. This nonparametric model does worse with the cross-validation performance when compared to the logistic regression.

Regression/Numeric Prediction

fit <- lm(trumpshare ~ percoled + gdppercap, data = election_turnout)
yhat <- predict(fit)
mean((election_turnout$trumpshare - yhat)^2)
## [1] 0.006015025
set.seed(10)
k = 5  #choose number of folds
data <- election_turnout[sample(nrow(election_turnout)), ]  #randomly order rows
folds <- cut(seq(1:nrow(election_turnout)), breaks = k, labels = F)  #create folds
diags <- NULL
for (i in 1:k) {
    train <- data[folds != i, ]
    test <- data[folds == i, ]
    ## Fit linear regression model to training set
    fit <- lm(trumpshare ~ percoled + gdppercap, data = train)
    ## Get predictions/y-hats on test set (fold i)
    yhat <- predict(fit, newdata = test)
    ## Compute prediction error (MSE) for fold i
    diags <- mean((test$trumpshare - yhat)^2)
}
mean(diags)
## [1] 0.004075225

There are no signs of overfitting, in fact the model does better when it is cross-validated. However, overall both models perform exceptionally well with an almost near 0 MSE. The CV is only .002 better than the linear regression model fitted to the entire dataset.

Python

library(reticulate)
word1 <- "you"

word2 <- "all"
equal <- "="
word3 = "y'all"
print(r.word1 + " "+  r.word2)
## you all
cat(c(word1, word2, equal, py$word3))
## you all = y'all

I first declared some words in the R and python chunk, then I printed in the python code using the words from the R chunk and referred to them with r.varname. Then, at the last R chunk I combined code from the first R chunk and the python chunk (using py$varname) to make a phrase.

Concluding Remarks

This dataset seems to have its quirks with the CV performing better than when the model is trained to the entire dataset. There are no strong assumptions that can be made from this model due to its low AUC and small average silhouette width.